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ABSTRACT 

We study verification of systems whose transitions consist of 
accesses to a Web-based data-source. An access is a lookup 
on a relation within a relational database, fixing values for 
a set of positions in the relation. For example, a transition 
can represent access to a Web form, where the user is re- 
stricted to filling in values for a particular set of fields. We 
look at verifying properties of a schema describing the possi- 
ble accesses of such a system. We present a language where 
one can describe the properties of an access path, and also 
specify additional restrictions on accesses that are enforced 
by the schema. Our main property language, AccLTL, is 
based on a first-order extension of linear-time temporal logic, 
interpreting access paths as sequences of relational struc- 
tures. We also present a lower-level automaton model, A- 
automata, which AccLTL specifications can compile into. 
We show that AccLTL and A-automata can express static 
analysis problems related to "querying with limited access 
patterns" that have been studied in the database literature 
in the past, such as whether an access is relevant to an- 
swering a query, and whether two queries are equivalent in 
the accessible data they can return. We prove decidability 
and complexity results for several restrictions and variants 
of AccLTL, and explain which properties of paths can be 
expressed in each restriction. 

1. INTRODUCTION 

Many data sources do not expose either a bulk export 
facility or a query-based interface, enforcing instead many 
restrictions on the way data is accessed. For example, access 
to data may only be possible through Web forms, which 
require bindings for particular fields in the relation [16, 4]. 
Querying with limited access patterns also arises in other 
middleware contexts (e.g. federated access to data in Web 
services) as well as in construction of query interfaces on 
top of pre-determined indexed accesses [20]. For example, 
a Web telephone directory might allow several Web forms 
that serve as access methods to the underlying data. It may 
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have an access method AcMi accessing a relation 

Mobile^ (name, postcode, street , phoneno) , 

where AcMi allows one to enter a mobile phone customer's 
name (the underlined field) and access the corresponding set 
of tuples containing a postal code, mobile phone number and 
street name. The same site might have an access method 
AcM2 on relation 

Address (street, postcode, name, houseno) 

allowing the user to enter a street name and postcode, re- 
turning all corresponding resident names and housenumbers. 
Formally an access method consists of a relation and a col- 
lection of input positions: for AcMi, position 1 is the sole 
nput position, while for AcM2 the first two positions are 
nput. An access consists of an access method plus a bind- 
ing for the input positions - for example putting "Smith" 
into method AcMi is an access. The response to an access 
is a collection of tuples for the relation that agree with the 
binding given in the access. A schema of this sort defines a 
collection of access paths: sequences consisting of accesses 
and their responses. 

The impact of "limited access patterns" has thus been the 
subject of much study in the past decade. It is known that in 
the presence of limited access patterns, there may be no ac- 
cess path that completely answers the query, and there may 
also be many quite distinct paths. For example, the query 
Address(X,y, "Jones", Z) asking for the address of Jones is 
not answerable using the access methods AcMi and AcM2 
above. There are certainly many ways to obtain the max- 
imal answers: one could begin by obtaining all the street 
names and postcodes associated with Jones in the Mobile# 
table, entering these into the Address table to see if they 
match Jones, then taking all the new resident names we 
have discovered and repeating the process, until a fixedpoint 
is reached. If, however, Jones does not occur as a name 
in Mobile^, then this process will not yield Jones' tuple in 
Address. In general it is known [15[ that for any conjunctive 
query one can construct (in linear time) a Datalog program 
that produces the maximal answers to a query under access 
patterns: the program simply tries all possible valid accesses 
on the database, as in the brute-force algorithm above. 

In the absence of a complete plan, how can we determine 
which strategy for making accesses is best? Recent works 
[4, 3[ have proposed optimizing recursive plans, using access 
pattern analysis to determine that certain kinds of accesses 
can not extend to a useful path. An example is the work in 
[3] which proposes limiting the number of accesses to be ex- 
plored by determining that some accesses are not "relevant" 
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Known Facts = 



^Mohilei ("Smith", 



Known Facts^ Known Facts^ 

Mobile#( "Smith", 0X13QD, "Parks Rd", 555 1212) 



/Address ("Parks Rd",0X13QD, ?, ?j^ 

Known Facts= 

Address ("Parks Rd",0X13QD, "Smith", 13) , 
Address ("Parks Rd", 0X13QD, "Jones", 16) , 
Mobile* ("Smith", 0X13QD, "Parks Rd", 5551212 ) ... 



Figure 1: Tree of possible paths associated with a 
schema 



to a query. An access is long term relevant if there is an 
access patli that begins witli the access and uncovers a new 
query result, where the removal of the access results in the 
new result not being discovered. [3] gives the complexity of 
determining relevance for a number of query languages. 

Long term relevance is only one property that can be used 
to measure the value of making a particular access - for ex- 
ample we may want to know whether there is an access that 
reveals several values in the query result. Furthermore, "lim- 
ited access patterns" represent only one possible restriction 
that limits the possible access paths through a web interface. 
Many other restrictions may be enforced, e.g: 

• Restrictions that follow from integrity constraints on 
the data: e.g. a mobile phone customer name will not 
(arguably) overlap with a street name. Thus in an it- 
erative process for answering the query given above, 
we should not bother to make accesses to the Mobile^ 
table using street names we have acquired earlier in 
the process. It is also easy to see that key constraints, 
and more generally functional dependencies, can play 
a crucial role in determining whether an access is rel- 
evant. 

• Access order restrictions: e.g. before making any ac- 
cess to Mobile^, the interface may require a web user 
to have made at least one access to Address. 

• Dataflow restrictions; before performing an access to 
Mobile^ on a name, the web user must have received 
that name as a response to a call to Address. 

Ideally, a query processor should be able to inspect an ac- 
cess and determine whether it is a good candidate for use, 
where the assumptions on the paths as well as the notion of 
"good candidate" could be specified on a per-application ba- 
sis. In this paper we look for a general solution to specifying 
and determining which accesses are promising: a language 
for querying the access paths that can occur in a schema. 
We show that every schema can be associated with a labelled 
transition system (LTS), with transitions for each access and 
nodes for each "revealed instance" (information known after 
a set of accesses). A fragment of the LTS for the schema 
with access methods AcMi and AcM2 is given in Figure 1. 
Paths through the LTS represent possible access/response 
sequences of the Web-based datasource. There are infinitely 
many paths - in fact every access could have many possible 
responses. But the access restrictions in the schema place 
limitations on what paths one can find in the LTS. We can 
then identify a "query on access paths" with a query over 



this transition system. This work will provide a language 
that allows the user to ask whether a given kind of path 
through instances of the schema is possible: e.g. is there 
a path that leads to an instance where a given conjunctive 
query holds, but where the path never uses access AcMi? Is 
there a path that satisfies a given set of additional dataflow, 
access order restrictions, or data integrity constraints? 

Paths are often queried with temporal logic [13]. We will 
look at natural variations of First-Order Linear Temporal 
Logic (FOLTL) for querying access paths. We look at a 
family of languages denoted AccLTL(L) ("Access LTL"), pa- 
rameterized by a fragment L of relational calculus. It has 
a two-tiered structure: at the top level are temporal opera- 
tors ("eventually", "until") that describe navigation between 
transitions in a path. The second tier looks at a particular 
transition, where we have first-order (i.e. relational calcu- 
lus) queries that can ask whether the transitions satisfy a 
given property described in L. The relational vocabulary we 
consider for the "lower tier" will allow us to describe transi- 
tions given by accesses; it allows us to refer to the bindings 
of the access, the access method used, and the pre- and 
post-access versions of each schema relation. Consider the 
following AccLTL sentence: 

(^3n3p3s3ph Mobile#p^g(n,p, s,p/i)) U 

(3n IsBindAcMi (?^) A 3s3p3/i AddresSpro(s,p,n, /i)) 

The relational query prior to the "until" symbol U states 
that there are no entries in Mobile# - the Mobile# ta- 
ble prior to the access. The query after the until symbol U 
states that an access was done with method AcMi and bind- 
ing n, where value n appeared in the Address table prior to 
the access. Hence this "meta query" returns the set of access 
paths which have no entries revealed in relation Mobile# un- 
til an access AC is performed, where AC has method AcMi 
and uses a name that already exists in the Address table. 
In this work we will not be interested in returning all paths 
satisfying a query (there are generally infinitely many). We 
will check whether there is a path satisfying a given spec- 
ification. This is a question of satisfiability for our path 
query language. We may also want to check that every path 
through the system is of a certain form; this is the validity 
problem for the language - bounds for validity will follow 
from our results on satisfiability. 

We denote the logic containing the above sentence by 
AccLTL(FOa^c)) where FO^tc is the collection of positive 
existential queries over a signature consisting of: the ac- 
cess methods, bindings, and the pre- and post- access ver- 
sion of each relation used in a transition. AccLTL (FOAtc) 
can express a wide variety of properties. Unfortunately we 
show that satisfiability for the logic is undecidable. How- 
ever, we show that a rich sublanguage of AccLTL(F0a1:c)) 
denoted AccLTL^, has a decidable satisfiability problem. 
In AccLTL^ the formulas involving the bindings only oc- 
cur positively. We give bounds on the complexity of this 
fragment, using a novel technique of reduction to contain- 
ment problems for Datalog. We then look at the exact com- 
plexity of smaller language fragments, and show that the 
complexity can go much lower - e.g. within the polynomial 
hierarchy. The main thing we give up in these languages is 
the ability to express dataflow restrictions. We also study 
the complexity and expressiveness of extensions of the lan- 
guages with inequalities and with branching time operators. 
In summary, our contributions are: 
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• We present the first query language for reasoning about 
the possible paths of accesses and responses that may 
appear in a Web form or other limited-access data- 
source. 

• We show that combining a natural decidable logic for 
temporal data (LTL) with conjunctive queries gives an 
undecidable path query language. 

• We show that by restricting to queries that are "bind- 
ing positive", we get a decidable path query language. 
In the process we introduce a new automaton model 
that corresponds to a process repeatedly querying a 
Web data source. We show that analysis of these 
"access automata" can be performed via reduction to 
(decidable) Datalog containment problems. The au- 
tomaton and logic specification languages are power- 
ful enough to express a rich set of data integrity con- 
straints, access order restrictions, and data flow re- 
strictions. 

• We show that the complexity of the logic can be de- 
creased drastically by restricting the ability to express 
properties of the bindings that occur in accesses. The 
resulting language can still express important access 
order and data integrity restrictions, but no dataflow 
restrictions. 

• We determine the impact adding inequalities to the re- 
lational query language, and of adding branching op- 
erators, both in terms of expressing critical properties 
of accesses and on complexity of verification. 

Organization: Section 2 gives the basic definitions re- 
lated to access patterns, along with our family of languages 
AccLTL(I/). Section 3 gives our results about the full lan- 
guage AccLTL(FOitc) while Section 4 deals with AccLTL+ 
and its restrictions. Section 5 discusses extensions of AccLTL* 
Section 6 gives conclusions and overviews related work. Most 
proofs are deferred to the full paper. 

2. DEFINITIONS 

Schemas and paths through a schema. Let Types be 
some fixed set of datatypes, including at least the integers 
and booleans. Our schemas extend traditional relational 
schemas under the "unnamed perspective" [1]. A schema 
Sch includes a set of relations {5*1 ... Sn}, with each Si as- 
sociated with a function from {1 . . . rii}, where n^ is the arity 
of Si, to Types. We refer to the set {1 . . . rii} as the positions 
of Si and the output of the function as the domain of the j^ 
position. An instance I for the schema consists of a finite 
collection I{Si) of tuples for each relation Si, where a tuple 
is a function from the positions of Si to the corresponding 
domain. 

A schema will also have a collection of access methods, 
where each method AcM is associated with a relation Si 
and a collection of mput positions Inp(AcM). Informally, 
each access method allows one to input a tuple of values for 
Inp(AcM) and get as a result a set of matching tuples. 

An access consists of an access method and a binding - a 
mapping taking the input positions of the method to their 
domains. A boolean access is one where the access method 
has as inputs every position of the relation - it is thus a mem- 
bership test. We will use an intuitive notation for accesses, 
often omitting the access method. Mobile#("Jones",?, ?,?) 
is an access to relation Mobile^ asking for all phone num- 
ber information for people named "Jones". A boolean access 



is Mobile#("Jones","OX13QD", "Parks Rd',23)?, where we 
add the ? to make clear it is an access. 

Given an access (AcM, 6), a well-form.ed output for AcM 
(on instance I) is any set of tuples r in I in the relation of 
AcM that is compatible with b on the input positions. We 
also refer to this as a well-formed response. 

A sequence ((AcMi,foi),ri), . . . , ((AcM„,6„),r„) of ac- 
cesses and well-formed responses for some instance I is an 
access path for the instance I. We also refer to any sequence 
of accesses and responses as an access path (without refer- 
ence to any instance). Note that every such sequence is an 
access path for some instance - the instance containing all 
returned tuples. Given an access path p and an initial in- 
stance Jo the configuration returned by p on /q, Conf(p, Jq) 
is the instance where relation Si contains Io{Si) unioned 
with all tuples returned by any access to Si in p. When 7o is 
empty or understood from context we refer to the instance 
resulting from p, or Conf(p). 

As mentioned in the introduction, one is not interested in 
arbitrary paths, but those satisfying additional "sanity prop- 
erties". We allow our schemas to prescribe some common 
additional properties of access methods, while additional re- 
strictions can be expressed in the logics. The weakest prop- 
erty we consider here is called idempotence: an access path 
is idempotent if whenever the path repeats the same access, 
it obtains the same results. This corresponds to the require- 
ment that accesses are deterministic. A stronger property 
is that accesses are exact: an access path is exact on an 
instance I if for every access (AcM, 6), the corresponding re- 
sponse R contains exactly the tuples in the relation of AcM 
which agree with b on the input positions. An access path is 
exact if it is exact for some input instance. Put another way, 
an exact access path is one that contains sound and com- 
plete views of the input data for all accesses made. Most 
web sources are not expected to be exact - an online music 
site will generally not contain information about all online 
music. However, some forms may be known to have canon- 
ical information - e.g. a web form accessing data from a 
trusted government agency. We allow situations which mix 
exact and non-exact accesses. In general, a schema may say 
that some access methods are exact, some are idempotent, 
and some are neither. Given a set of access methods S, we 
say that an access path is S'-exact if there is an instance I 
such that the path is exact for all accesses with methods in 
S, and similarly talk about S-idempotence. 

Finally, we often do not want paths in which values for 
access method inputs are "guessed", but are only interested 
in paths where the input to an access method is a value 
already known. Given an instance Jo (representing the "ini- 
tially known information") an access path p = ai,ri . . . is 
grounded in Iq if every value in a binding a^ occurs either in 
Jo or in a response from some aj with j < i. Groundedness 
is a special kind of dataflow restriction - our largest logics 
will be able to specify groundedness, along with more spe- 
cialized dataflow restrictions, but we allow them also to be 
imposed in the schema. 

A labelled transition system (LTS) is of the from (No, L, T) 
where No is a collection of nodes, L is a collection of edge 
labels, and T is a collection of transitions — elements of 
No X L X No. With any schema and initial instance lo we can 
associate a labelled transition system where the nodes are all 
the instances containing Iq as a subinstance, the labels are 
all the accesses, and there is a transition (I, AC, I') whenever 
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there is some response r to AC such that Conf((AC,r),I) = 
I'. We can also consider the restricted LTS where we only 
allow paths with transitions (I, AC, I') in which the access 
AC is grounded at I, only paths that are idempotent, or 
only paths that are exact for a given subset of the access 
methods. 

Logics for querying access paths. To query paths it is 
natural to use Linear Temporal Logic (LTL) [13]. LTL for- 
mulas define positions within a path. In Prepositional LTL, 
the positions within paths are associated with a preposi- 
tional model over some set of propositions, and one can then 
build up formulas from the propositions using the modal op- 
erators, S (since) , U (until), X" (previously), X (next), and 
F (eventually). For example F(Q A XP) holds on positions 
j in a path p that come before some position j such that 
proposition Q holds at j and proposition P holds on posi- 
tion j + 1. We want to extend LTL to deal with access paths, 
which are not just a sequence of prepositional structures. 
Each position in an access path consists of an access and its 
response; the corresponding path through the LTS defined 
above consists of transitions fi . . . f„, where a transition ti is 
of the form (L, (AcM^, &i),L+i)- There is obviously a one- 
to-one correspondence between access paths and LTS paths 
as above, and we will often identify them. Since the posi- 
tions carry with them a relational structure, we will use a 
variant of First Order Linear Temporal Logic (FOLTL) [13], 
which allows the use of first-order quantifiers and variables 
along with modal ones. We will deal here with a variant of 
FOLTL in which first-order sentences describing properties 
of positions can be nested inside temporal operators, but 
not vice versa. 

The embedded FO formulas have the ability to constrain 
the instance before the access as well as afterwards. Hence, 
for a given vocabulary Sch, we will consider formulas over 
the relational vocabulary SchAcc consisting of two copies 
7?pre,i?post of each schema relation R € Sch. In addition 
SchAcc contains predicates IsBindAcM for each access method 
AcM in Sch. The arity of IsBindAcM is the number of in- 
put positions of AcM. An LTS path f i . . . f„ is associated 
with a sequence of SchAcc structures, where the i^^ structure 
M{ti), corresponding to U = (L, (AcMi,6i),Ii+i) interprets 
each predicate Rpre using the interpretation of i? in L , each 
predicate i?post as the interpretation of R in L+i. The pred- 
icate IsBindAcM; holds of exactly the tuple bi while all other 
predicates IsBindAcM are empty. 

We now introduce Access Linear Temporal Logic (AccLTL 
for short), our main specification formalism. 

Definition 2.1. Let L be a subset of first-order logic over 
SchAcc. The logic AccLTL(L) has as atomic formulas every 
sentence of L, and is built up by the usual LTL constructors: 

-^tp I ipy tp I LphLp I Xkp \ if U ip 

The semantics of AccLTL(L) is given by the relation (p, i) l= 
ifi, where p = ti . . . t„ is an LTS path and i <n. It combines 
the standard semantics of L formulas with the usual rules 
for the constructors of LTL: 1. {p,i) ^ ip iW tp e L and M{ti) 
satisfies ip in the usual sense of first-order logic. 2. (p, i) l= -.99 
iff (p, i) ^ ip. 3. (p, i) 1= X (^ iff (p, i + 1) 1= ip. 4. (p, i) 1= 1/3 U V 
iff there exists j >i {p,j) t= ^ and Vi < k < j,{p,i) t= p. 
5. {p,i) t= pv Ip iff (p, i) t= p or (p, i) 1= tp. 

In the rest of the paper, we make use of the temporal op- 
erators G ("globally") and F ("eventually"). These operators 



can be expressed using X and U as usual in LTL. The lan- 
guage of a formula p is the set of paths p such that (p, 1) 1= ip. 
Our main language of interest is AccLTL (FOa^c)) where 
FOacc consists of all positive existential FO sentences over 
the signature SchAcc- 

Example 2.2 [3, 5] study query containment under (in our 
terminology, grounded) access patterns. Query Qi is con- 
tained in Q2 relative to a schema with access patterns means 
that for every grounded access path p, if the configuration 
resulting from p satisfies Qi, then it also satisfies Q2. In- 
formally, the facts about Qi that we can determine given 
the schema restrictions are contained in the facts we can 
determine about Q2. Using a containment algorithm, one 
can perform query minimization in the presence of access 
restrictions. 

In [5] containment under access restrictions is shown to 
be decidable for conjunctive queries, while ]3] studies the 
complexity of the problem. One can see that Qi is con- 
tained in Q2 under grounded access patterns iff the following 
AccLTL(FOAtc) formula is a validity (over grounded paths): 



G^iQT 



IT 



Here Q^"^ is obtained from Qi by replacing each schema 
predicate S by Spre (one could as easily use Spost). We will 
show that containment under grounded access patterns can 
be expressed in a restricted fragment of AccLTL (FOAtc)i 
as well as in an automaton-based specification formalism 
where validity relative to grounded access paths is decidable 
in 2EXPTIME. Our results will thus give tight bounds for 
containment under grounded access patterns. 

Example 2.3 A boolean access ACi is said to be long term 
relevant ]3] (LTR) for a query Q on an initial instance lo 
if there is an access path p = ACi,ri AC2,r2 . . . such that 
the configuration I resulting from applying p to lo satisfies 
Q, and the configuration resulting from the path with ACi 
dropped (i.e. AC2,r2 . . .) leads to a configuration where Q 
does not hold. In the terminology of ]3] we say it is LTR 
under grounded accesses if there is a grounded access path 
satisfying the above. 

This property can be expressed in AccLTL (FOa^c) in the 
following sense: for each lo, ACi = (AcMi,&i), and Q there 
is an AccLTL(FOAtc) formula p which is satisfiable iff ACi 
is LTR. Below we give the formula for Jo being the empty 
instance: 

F (^QP^^ A IsBindAcMi (&i) A QP°^') 

The formula checks that there is a path p and a response ri 
to ACi, such that Q holds after p but not after p, ACi,ri. 
But for a boolean access ACi, the instance after p, ACi,ri 
is the same as the one after ACi,ri,p. 

As mentioned in the introduction, we often want addi- 
tional data integrity restrictions to hold on the path. In 
AccLTL(FOAtc)) we can add on many data integrity restric- 
tions, such as the disjointness of names from streets, which 
would be expressed by a conjunction of several formulas, 
including: 

G(-i3n3p3s 3pfe3/in3n 3pc Mobile#(n, p,s,p/i) 
A AddresSpro(«-,pc, n',/in)) 

Similarly we can add access order restrictions and dataflow 
restrictions. For example, the following would restrict to 
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paths in which names input to Mobile^ must have appeared 
previously in Address: 

G((3n IsBindAcMi(n)) -^ 
3n3s3hn3pc IsBindAcMi (?^) A AddresSpre(s,pc, n, /in)) 

Example 2.4 (Data integrity restrictions, continued) Let 
Sch be a schema that includes, in addition to the access 
methods, a set of functional dependencies di = R^ : posi -* ai, 
where poSi are positions of 7?* and ai is a position of R'. We 
say that an access AcM is long-term relevant for Q under 
Sch if there is an instance I 5 lo satisfying all the FDs and 
an access path that reveals Q to be true, as in Example 2.3, 
but where each response returns oidy tuples in I. 

This can be expressed in AccLTL(L3), where Lj is the 
set of conjunctive queries with inequalities. 

F (^QP'" A IsBindAcM(&l) A Q"""') A 

A-F[3j72/ i?pro(y) Ai?pro(/)A 

i 

A yk = y'k/\ya,*y'aj 

k e posj 

where Q'^"' and Q^°^^ are defined as in the previous example. 
We will look at languages with inequalities in Section 5. 



Basic Computational Problems. The basic problem we 
consider is satisfiability of a sentence (fi, which by default 
means that there is some access path p such that {p, 1) l= (p. 
We will also consider satisfiability over grounded, idempo- 
tent, and (S-) exact paths. 

3. AN EXPRESSIVE LANGUAGE FOR AC- 
CESS RESTRICTIONS 

Since satisfiability for first-order logic is undecidable, it 
is clear that AccLTL(FO) has an undecidable satisfiability 
problem. Our first main result is that the same holds even 
when first-order formulas are restricted to be existential. 

Theorem 3.1. Satisfiability o/ AccLTL(FOi^t,) *s unde- 
cidable. 

This is surprising, in that AccLTL(FOAtc) formulas deal 
with a fixed set of existential sentences on the configuration, 
and as a path progresses these queries can only move from 
false to true as more tuples are exposed by accesses. 

The proof works by reducing the problem of determin- 
ing whether a collection F of functional dependencies (fds) 
and inclusion dependencies (ids) implies another functional 
dependency a. Since this problem is known to be unde- 
cidable [6], it suffices to reduce it to unsatisfiability of an 
AccLTL(FOite) formula. 

The difficulty here is that functional dependencies seem 
to require negation inside a universal quantification, while 
inclusion dependencies require quantifier alternation - in 
AccLTL(FOAtc) we have only boolean combinations of pos- 
itive formula. We now explain the main idea involved in 
bridging this gap, which will also be used in later unde- 
cidability arguments (Theorem 5.2). The schema for our 
accesses includes a successor relation of a total order over 
the tuples of each relation in Fu {a}. The successor relation 
is "created" via accesses - that is, we perform accesses that 
reveal associations between a tuple and its successor. For 



each relation 7? mentioned in F u {a} we also have relations 
Beg(_R) and End(i?). Our formula will enforce that these 
contain the first and the last tuples in the total order, re- 
spectively, by asserting the existence of additional accesses 
to these relations that reveal the first and last tuple. After 
all the relations are filled, the satisfaction of the different fd's 
and id's in F and the failure of a are verified. The satisfac- 
tion of the dependencies makes use of the successor relation, 
and we explain the idea for FDs. We verify a dependency for 
one tuple at a time, iterating on the tuples according to the 
order. We will use a new predicate Chk (R) whose arity 
is twice the arity of R. This predicate will have a boolean 
access. Chk {R){t,t ) holding at some instance indicates 
that f^ r is in accordance with the FDs on R. This will be 
done in a "nested loop" (a pair of nested "untils" in the logic) 
in which we iterate first over tuples t, then over tuples t , 
accessing them progressively within Chk^'^(i?). At every 
access, we check whether the FD is satisfied, and if it is we 
continue the iteration. 

4. VERIFIABLE SPECIFICATIONS: 

THE POSITIVE TRANSITION SUBLAN- 
GUAGE 

The undecidability proof of AccLTL(FOAtc) makes use of 
the ability of the logic to enforce that an access is made to 
a binding that does not satisfy a certain relation. We now 
consider a restriction of AccLTL(FOAtc) which adds an ad- 
ditional monotonicity condition. A AccLTL(FOa1:c) formula 
if is binding-positive if every atom of the form IsBind(w) oc- 
curs only positively in (/? - that is, under an even number of 
negations. 

Definition 4.1. The logic AccLTL^ is the set of binding- 
positive formulas in AccLTL(FOAtc)- 

Note that in AccLTL* we can describe the most basic 
dataflow constraint, the property of an access path being 
grounded: an access is grounded iff for every transition in a 
path, for every value that occurs in a binding, it occurs in 
some relation in the instance prior to the access: 

Cfaf IsBindAcM(a;i . ..a;™) A 

A V ^yRiyl■■■yr^)/\\/ yj=xi'j 

i<m Re Sch j <n 

Thus we can reduce satisfiability over grounded instances to 
satisfiability over all instances. Furthermore all the exam- 
ples in the introduction are expressible in this fragment; we 
can express relevance of an access to a query as well as con- 
tainment of queries under access patterns, restricting the 
paths to satisfy many data integrity, dataflow, and access 
ordering restrictions. 

Our next main result is that this restriction suffices to give 
decidability: 

Theorem 4.2. Satisfiability of AccLTL^ is decidable in 
3EXPTIME. The same is true for satisfiability over grounded 
instances and satisfiability over idempotent and exact ac- 
cesses. 

We will show Theorem 4.2 by going through another spec- 
ification formalism of interest in its own right, a natural au- 
tomaton model for access paths. These are Access-automata 
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( A-automata for short) , which run over access paths, using a 
finite set of control states. At each transition (I, (AcM, 6), I') 
of an access path the evolution function of the automaton 
tells what new states (if any) it can move to at the next 
position. The evolution function is a relational query that 
makes use of the binding, pre- and post- condition of the 
transition. 

Definition 4.3 (A- Automaton). Let Sc\i be a schema, 
SchAcc the corresponding schema with accesses (as defined in 
Section 2), and C a set of constants. An Access-automaton 
(A-automaton for short) over (Sch, C) is a tuple {S, so,F, S) 
where 

• S is a finite set of states, sq e S is an initial state, 
F c. S is a set of accepting states 

• S is a finite set of tuples of the form {s,il)^ A tp^,s') 
where s, s' are states, ill" is a positive boolean combina- 
tion of negated FOAtc sentences that can not mention 
the predicate IsBind, while ip^ is a FOa^c sentence; all 
these formulas can use constants in the given set C. 

Semantics. Let A = {S, so, -F, 5) be an A-automaton and let 
p be a path tj . . .tn through the LTS associated with Sch, 
where ti = {Ii,{AcMi,bi),li+i). A run of A on p assigns 
to every ti a 5i of the form {si,(pi,Si+i) in S so that the 
relational structure M{ti) associated with ti satisfies ifi. A 
run of A is further said to be accepting iff its first state is 
initial and its last state is final. The language L{A) accepted 
by an A-automaton A is the set of access paths for which 
there is an accepting run. Note that an automaton only 
accepts access paths, which by definition must satisfy at 
least the property that for each i, li+i extends \i solely by 
adding tuples to the relation of AcM;, and all tuples added 
are consistent with the binding on the input positions of 
AcMi. The definition of L{A) can be further qualified to 
account for other sanity conditions (e.g. exactness). 

A-automata are powerful enough to directly express rel- 
evance of an access in the presence of dataflow restrictions 
as well as disjointness constraints: 

Proposition 4.4. Let Q and Q' be two positive queries, 
ACS a set of access methods, and S a set of disjointness 
constraints. One can efficiently produce an A-automaton A 
such that Q is contained in Q' under limited access patterns 
with disjointness constraints iff the language recognized by A 
is empty. A similar statement holds for long-term relevance 
of an access to Q under disjointness constraints. 

The proposition above can be extended to a general result 
stating that high-level logical specifications can be compiled 
into A-automata. We say that an A-automaton A is equiv- 
alent to an AccLTL sentence (p if the language of the (p is 
the same as the language of A. The following result shows 
that each AccLTL* formula can be converted into an A- 
automaton. 

Lemma 4.5. For each AccLTL* formula <f there is an 
equivalent A-automaton of size exponential in the size of tp. 

We will show that emptiness of A-automata is decidable. 
Note that this decidability result together with Lemma 4.5 
completes the proof of Theorem 4.2. Again, there are vari- 
ants of the theorem for the various types of access, but we 
focus on the case of general accesses in the body of the pa- 
per. 



Theorem 4.6. Emptiness of A-automata is decidable in 
2EXPTIME. The same holds if accesses are restricted to be 
exact or idempotent. 

Notice that from Theorem 4.6 and Proposition 4.4 we get 
a 2EXPTIME upper bound for containment and long-term 
relevance. This improves on the prior known bounds [3, 5]. 
The proof uses a tight connection between A-automata 
and the containment problem for Datalog queries within 
positive first-order queries. This connection can also be ex- 
ploited to give a corresponding lower bound: 

Theorem 4.7. Emptiness of A-automata and satisfiabil- 
ity o/ AccLTL* are both 2EXPTIME-hard. 

4.1 Automata, Datalog, And Proof Sketch of 
Theorem 4.7 

The proof of this result makes use of some new tools that 
we overview here. We reduce the emptiness problem for 
A-automata to the problem of whether a Datalog program 
is contained within a positive first-order query. Roughly 
speaking, we show that these automata can be captured 
by a conjunction of a Datalog query and the negation of a 
union of conjunctive queries. The reduction to this prob- 
lem involves several stages, and the first step goes through 
a syntactic subclass of A-automata, called "progressive A- 
automata", defined below. We will show that the problem of 
testing emptiness of A-automata can be reduced to check- 
ing the emptiness of a bounded number of progressive A- 
automata. 

Progressive Automata. In the following, given a boolean 
combination of FOa^c formulas p, we denote by (p the for- 
mula 3a; p' where p' is obtained from p by replacing each 
atom IsBindAcM(f) hy t = x and by replacing each predicate 
Rprc by -Rpost. For a set $ of sentences, we say that a for- 
mula is a complete ^-type if it is a conjunction that contains 
every formula of "1? either positively or negated. A formula is 
a "pure pre" (resp. "pure post") formula if it only mentions 
predicates of the form -Rpro (resp. -Rpost). 

Definition 4.8 (Progressive A-automaton). An 
A-automaton A = (S, sq, F, S) over (Sch, C) is progressive if 
there is a pure pre formula Tpre(so) that does not use the 
predicate IsBindAcM, a set of pure post FOa^c sentences $, 
and a function Tpost mapping the states of A to complete 
<t-types such that: 

1. For any transition {s,p,s'), if both IsBindAcM(f) and 
IsBindAcM' (?) are atoms in p, then AcM = AcM'. 

2. For any transition (s,p,s'), p implies Tpost(s'). 

3. For any transition (sq, p, s') that leaves the initial state, 
p implies Tpro(so). 

4. For any transition {s,p,s') for which s and s' are in 
the same strongly connected component, Tpost (s) is 
equivalent to Tpost(s'); also Tpost(s') implies p. 

5. The maximal strongly connected components of A form 
a sequence Ci, . . . ,Ch. That is, for each i < h, there 
is exactly one transition {s,p,s') such that s e d and 
s' e Ci+i . For such a transition that connects two max- 
imal strongly connected components, all atoms of the 
form IsBindAcM (t) must not contain variables; that is, 
i must be a sequence of constants. 

6. The initial state is in Ci and all accepting states are 
in Ch- 
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We will call h the height of A. A-automata correspond, 
up to emptiness, to unions of progressive automata. 

Lemma 4.9. For every A-automaton A, there are pro- 
gressive A-automata Ai,. . . ,A„, such that, for each i < n, 
the size of Ai is polynomial in the size of A, n is exponential 
in the size of A, and L{A) ts empty iff L{Ai) u ... u L{A„) 
is empty. 

Prom progressive A-automata to containment of Dat- 
alog in Positive Queries. We now proceed to show that 
emptiness of progressive A-automata is decidable. Together 
with Lemma 4.9 this implies the decidability of (general) A- 
automata. This will involve reducing the emptiness of a pro- 
gressive A-automaton to the problem of whether a Datalog 
program is contained in a positive first order logic sentence. 
Recall that a Datalog program is defined with respect to 
two database schemas, called the extensional schema and 
the intensional schema. A Datalog program "P is a finite 
set of rules of the form "head : - body" where head is 
an atomic formula R{x) with a relation symbol R in the 
intensional schema, and where body is a conjunctive query 
that can use relation symbols from the intensional and the 
extensional schema. Each Datalog program V contains a 
distinguished goal predicate Q. We use the standard notions 
of the least fixedpoint of a Datalog program "P on a database 
D (see [1]), and we denote this fixedpoint by V^D). We say 
that a Datalog program V accepts a database D if the goal 
predicate of V is not empty in V^D). 

Lemma 4.10. Let A he a progressive A-automaton. Then 
there exists a Datalog program Va and a positive first order 
logic sentence V'a such that L{A) is not empty iff Va is 
not contained in V'a ■ One can construct these in polynomial 
time in the size of A. 

The proof of this lemma is itself quite involved. The basic 
idea of this proof is that Va enforces the positive constraints 
of A while V'a enforces the negative constraints. Recall that 
in a progressive automaton, the evolution is in a fixed num- 
ber of stages, based on the number of subqueries satisfied. A 
stage represents a strongly connected component of the au- 
tomaton. The extensional database D will have predicates 
BackgroundRj representing the part of relation R that be- 
comes visible to A at the end of each stage i, along with 
predicates IntBackgroundRj representing the data that be- 
comes visible when crossing from one stage to the next. The 
important intensional predicates ViewRi will represent in- 
termediate stages of the predicates BackgroundR^ within the 
evolution of each stage. The Datalog program Va will have 
rules corresponding to the evolution of ViewRi by adding 
tuples from BackgroundRj. To ensure that the tuples corre- 
spond to some valid binding, Va will have rules guaranteeing 
that only tuples that satisfy the appropriate formulas can be 
added to ViewR^. We can do this with a Datalog program 
by adding appropriate intermediate relations, exploiting the 
fact that the constraints on the guards are positive, and 
hence represented in non-recursive Datalog. 

The role of the positive query V'a is twofold: First, V'a 
will enforce the negated conjunctive queries in the tran- 
sitions - in particular, V'a will contain constraints on the 
relations BackgroundR^ and IntBackgroundR^ that enforce 
that these only contain tuples that satisfy these negated con- 
straints. In this way, whenever the Datalog program adds 



tuples to the intensional relations, these tuples are guaran- 
teed to satisfy the corresponding negative constraints. The 
second purpose of V'a is to enforce that for each i, only one 
relation among the IntBackgroundR^ is non-empty. This 
is important, as these relations contain the tuples that the 
Datalog program might add when simulating the automaton 
transitioning from one strongly connected component to the 
next. On such a transition an A-automaton can only per- 
form one access, and hence the Datalog program should only 
be able to add tuples from one relation IntBackgroundR^ 
into ViewRi. 

In the proof that our construction is correct, we show 
that the Datalog program Va can be decomposed into sub- 
programs Vi, . . . , Vh that correspond to the decomposition 
of the A-automaton into strongly connected components 
Ci , . . . , Ch in the following sense: Whenever an A-automaton 
has a run that ends in its strongly connected component d , 
i < h then the subprogram ViU . . .uVi of V adds tuples to 
the intensional database that correspond in a certain way to 
the tuples that A has obtained using accesses. 

Completion of the proof of Theorem 4.6. Let us review 
what we have accomplished thus far; we have reduced ques- 
tions about our logic to non-emptiness of the automata, and 
non-emptiness of an automaton we have reduced to deter- 
mining whether a Datalog program is contained in a positive 
query. To complete the proof of Theorem 4.6 we need the 
following new result, that generalizes a theorem of Chaud- 
huri and Vardi [7]: 

Proposition 4.11. The containment problem of a Dat- 
alog program P in a positive first-order sentence ifi, where 
both P and ip may make use of constants, is in 2EXPTIME. 

The proof of this result is in the appendix. Theorem 4.6 
follows from the proposition and the reduction given earlier. 

4.2 Restricted Binding Predicates And Reduc- 
tion To Propositional LTL 

We now look for path query languages where the satis- 
fiability problem has lower complexity. We will do this by 
giving up the ability to talk about the exact dataflow from 
data instances to bindings. This will allow us to get veri- 
fication algorithms based on reduction to standard Propo- 
sitional Linear Temporal Logic verification, a well-studied 
problem for which many tools are available [8]. 

For a relational schema Sch, we define the vocabulary 
Scho-Acc as in SchAcc but instead of the n-ary predicates 
IsBindAcM, we have only a 0-ary predicate IsBindAcM- A 
transition ti = (L, (AcMi,6),Ii+i) is now associated with 
the relational structure M'{ti) in which Spre,Spost are in- 
terpreted as before, and IsBindAcM () holds exactly if AcM = 
AcMi. We wiU now consider AccLTL(FOo^acc). in which 
the first-order formulas use only Scho-Acc- That is, in the 
logic we can refer to which access was performed, but can 
not express anything about the bindings used. 

Going back to Example 2.2 and 2.3 we say that the ba- 
sic relevance properties are in this language, provided that 
we do not impose any dataflow restrictions - including any 
restrictions that access paths are grounded. On the other 
hand, we can still impose the access order restrictions of Ex- 
ample 2.3. We now see that by curtailing the expressiveness, 
the complexity goes down significantly. 
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Theorem 4.12. Satisfiability of an AccLTL(F011acc) for- 
mula (over all access paths) is PSPACE-complete. The 
same holds if particular access methods must be exact or 
idempotent. 

Proof. The PSPACE-hardness of our problem comes 
from the PSPACE-hardness of the satisfiability problem of 
a LTL formula over finite words [13]. The upper bound is 
proven by bounding the size of the underlying data, and 
then applying results about propositional LTL. 

We now prove the upper bound, focusing on the case of 
general access paths. Let Sch be a schema, and ip he a. for- 
mula of AccLTL(FOo^Acc)- First, we demonstrate that if 
there exists an access path that satisfies ip then there exists 
one where the size of each instance is bounded by a polyno- 
mial function in the sizes of ip and Sch. 

The key is the following "Boundedness Lemma": 

Lemma 4.13. An AccLTL(FOo^acc) formula tp is satis- 
fiable iff there exists a path p which satisfies the following 
properties: 1. The instances in p have sizes bounded by a 
polynomial function in the sizes of ip, and Sch. 2. The set of 
bindings used in p has size bounded by a polynomial function 
in the sizes of (fi 

Proof. Let some ip be given. Suppose that (p is sat- 
isfiable. Then there exists a path p that satisfies (p. We 
define the positive sentences of p to be the maximal sub- 
sentences of ifi that belong to FOq^acc- Consider the fol- 
lowing rewrite rules: for each AcM e Sch we replace the 
formula IsBindAcMA^, where IsBindAcM is a predicate, by 
the formula V>. We also replace the formula IsBindAcM vV" 
where IsBindAcM is a predicate by the formula ^p. We de- 
note by Qf{(p) the set of FOq^acc sentences that have been 
obtained from a positive sentence of p by inductively apply- 
ing the above rules until there are no more occurrences of 
predicates IsBindAcM in the result. 

Let {qi , . . . , Qm} be the set of sentences appearing in Qf{p) 
that are satisfied by the last instance I„. Let pi^^, . . . ,pi^ 
be the set of transitions in the path p such that pi . is the 
minimal transition in p that satisfies qj. Let hj be a ho- 
momorphism from Qj to pi-. We let (I/-i, AC/,1/) be the 
last transition in p. Let I^ be the minimal subinstance of 1/ 
such that for all i h^{qi) c (1'^)*""= u (1/)^°^'), where for any 
instance I of the original schema, 1*""^ is obtained from I by 
interpreting relations Rprc by the interpretation of R in /, 
while p"^' is obtained from 1 by interpreting relations i?post 
by the interpretation of R in I. 

Since we only need to consider witnesses to positive queries, 
it is easy to check that I'f can be constructed and has size 
polynomial in the sizes of p and Sch. We can thus construct 
a path p' that contains the intersection of the instances of 
p with the instance I', p' satisfies p, and the size of the 
instances of p' are bounded by a polynomial function in the 
size of ip and Sch. 

We now restrict the bindings used in p' . Let p be a path. 
An access (AcMi,bi) is necessary for p if new tuples are 
returned by it (i.e. tuples not in the previous instance within 
p), and unnecessary otherwise. Note that if we have a path 
and we change the binding on some unnecessary access to 
anything of the appropriate arity, while returning emptyset, 
then it is still a valid access path. 

So without loss of generality, we can arrange that the set 
of bindings used in p' consists of the necessary accesses in p' 



plus a single binding for each access method, used in place 
of every unnecessary access on that method. Therefore the 
set of bindings is a set of tuples having size bounded by a 
polynomial function in the sizes of p and the schema. D 

Given the lemma, we can now apply the following algo- 
rithm which is easily seen to be in NPSPACE: 

1. First, we guess a finite sequence of instances Ii . . . I„ 
and a sequence of accesses A, each of polynomial size 
(with the polynomial given by Lemma 4.13). In the 
remaining steps, we will check whether there is a wit- 
ness path using the bindings of these accesses and only 
these instances. 

2. We translate the AccLTL(FOq^acc) formula (p into 
an ordinary LTL formula ^ in a propositional alpha- 
bet that encodes information about which of the in- 
stances and bindings are used. This formula will be 
constructed so that it is satisfiable over words iff p is 
satisfiable. 

3. Then, we apply any PSPACE algorithm for LTL sat- 
isfiability of p over finite words. 

We now explain in more detail the translation to ordinary 
LTL that is the key step in the high-level algorithm above. 
Fix a sequence s = Ii . . . 1„ of distinct instances as well as a 
sequence of accesses A, both of polynomial size. We denote 
by B, the union of the set of bindings used in A and the 
set Uacm{&Acm} where ^acM is a binding of AcM using some 
values appearing in B. 

We associate propositions with transitions of any of the 
following forms: 

• Transitions of form (/;, (AcM,6),/i) where 6 is in B 
and compatible with AcM. 

• Transitions of form {Ii,Ai,Ii+i) 

The set of transitions of the above forms is denoted T{I, B). 
For each i, we denote by T{i) the set of transitions of the 
form {Ii, {AcM, b),Ii). For each i, we denote by t^,^ the 
transition {Ii,A{i),Ii+i). For each i, we denote by P{i) 
the set of propositions associated with the transitions of 
the form {Ii, {AcM, b),Ii). For each i, we denote by pi,^ 
the proposition associated with the transition {Ii,Ai,Ii+i). 
The set of all such propositions is denoted S. The words de- 
scribed by p are over alphabet 2 . Intuitively, each letter of 
a word would be used to describe a transition (1, (AcM,&),I'). 

We now describe the construction of p. 

First, we describe some "sanity axioms" stating that a run 
associated with p really corresponds to some access path. 
This requires: 

• Every position has exactly one proposition of S. 

• The order of the instances in s is respected. This is 
expressed by the formula: 



A g(p^( V p'up.,^)) 



i.p e P(i) 



v'eP(i) 



i p"eP(i+l) 

( V P"'vpo,^) 

p"'6P(0) 

Next we rewrite (^ to i^ by replacing each positive sentence 
g of (^ by the union over of p e S over all the previous 
transitions that satisfy it. 
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We claim that the ^ is satisfiable over ordinary words iff kp 
is satisfiable over access paths that conform to the sequence 
s and the bindings in B. The direction from right to left 
requires taking an access path and performing the obvious 
prepositional abstraction. In the other direction, we take a 
prepositional word w\ . . . w„ satisfying Tp. The first sanity 
axiom implies that exactly one transition proposition p is 
associated with Wi. The second sanity axiom implies that 
the instance reached in the transition associated with w(i) is 
the same as the initial instance of the transition associated 
with w(i + 1). One can check that this gives the required 
access path for ip. 

Restricting LTL operators. Let LTLx be the subset of 
LTL that only uses the temporal operator X. We denote 
by AccLTL(X)(FOo^Acc) fh^ corresponding sublanguage of 
AccLTL(FOite). 

AccLTL(X)(FOo^A.cc) is extremely limited in expressive- 
ness, since it can only talk about paths of some fixed length. 
However, there are properties for which such small paths are 
sufficient. Consider Example 2.3. It is easy to see that Q is 
LTR over all accesses iff it is LTR over access paths of size \Q\ 
- a counter example to long-term relevance has only polyno- 
mially length. But LTR over small paths can be expressed in 

AccLTL(X)(FO,^!acc)- Thus AccLTL(X)(FOo!acc) is suf- 
ficient to tell whether an access might have an impact on 
answering a query, but without taking into account of even 
the most basic dataflow restriction on paths. 

Theorem 4.14. SaiM_/ia6«/Jii/ o/AccLTL(X)(FOo^Acc) *« 
1^2 -complete, even when certain accesses are restricted to be 
exact or idempotent. 

Hardness. Non-containment of positive relational queries, 
where positions can be restricted to have finite (i.e. enum) 
datatypes can be reduced to the unsatisfiability problem of 
either language - this problem is known to be 112 -hard. 

Upper-Bound. Let an AccLTL(X)(FOo^Acc) formula ip 
be given. We first note that Lemma 4.13 also holds for the 
logic AccLTL(X)(FOo^Acc)- Using this we can reduce to 
the language prepositional LTLx, which has a satisfiabil- 
ity problem in NP. In the reduction we will again guess a 
small number of small instances and bindings, and we will 
also guess which positive queries of <^ will be true - this 
guess will then be verified via a sequence of NP (for queries 
guessed to be true) and co-NP (for queries guessed to be 
false) subroutines. We can then rewrite the original formula 
ifi to an LTLjf formula that is satisfiable iff ip is satisfiable 
on a sequence based on the guessed instances and bindings. 

5. EXTENSIONS AND LIMITS 

We look at the impact of two natural extensions on our 
decidability results: allowing inequalities and branching for- 
mulas. 

5.1 Extension To Inequalities 

Our results on decidable fragments did not use inequali- 
ties, and inequalities are useful for expressing data integrity 
constraints. The most obvious example involves keys and 
functional dependencies, as discussed in Example 2.4. 

By making a straightforward modification of the proofs 
without inequalities, we can see that inequalities add noth- 
ing to the complexity of AccLTL(FOo^acc) '^nd its sublan- 
guages. 



Theorem 5.1. Letting FOq^a^,^ be the language of posi- 
tive queries with inequalities over the restricted vocabulary 
with only the 0-ary predicates IsBindAcM, we have that 

• satisfiability of AccLTL(FOq^]^^^) is in pspace (and 
hence PSPACE-complete by Theorem 4-1^) 

• satisfiability o/ AccLTL(X)(FOj,^aj,^) is in S2 (hence 
S2 -complete by 4-M) 



Using the language above, one can express relevance or 
containment in the presence of functional dependencies, ac- 
cess order constraints, and disjointness constraints, but not 
dataflow constraints. 

For the language AccLTL^, shown decidable in Theorem 
4.2, inequalities make a dramatic difference. The proof 
of the theorem below shows that we cannot capture both 
dataflow restrictions like groundedness along with rich in- 
tegrity constraints such as functional dependencies, while 
retaining decidability. The proof also shows that many ex- 
tensions of AccLTL^ with aggregation - basically, any that 
are expressive enough to capture FDs - will be undecidable. 

Theorem 5.2. For binding-positive AccLTL(FOa^'^*), 
satisfiability is undecidable. 

Proof. Again we reduce the problem of implication of 
functional dependencies (fds) and inclusion dependencies 
(ids) for relational databases to the problem of the unsatis- 
fiability of a AccLTL^ extended with inequalities. 

Let r be a set of inclusion and functional dependencies, 
and CT be a functional dependency over Sch. 

The approach to the reduction is similar to that in The- 
orem 3.1. We will make iterative accesses to a successor 
relation of a total order over the tuples. We will also access 
relations Bog(i?) and End(_R), and verify that they contain 
the first and the last tuples of relation R according to the 
order. While iterating through the relations according to 
the successor relation, the satisfaction of the different fd's 
and id's and the failure of a are verified. The satisfaction 
and failure of fd's can be reduced to the satisfaction of a 
boolean combination of conjunctive queries with inequali- 
ties - the successor relation is not needed. The satisfaction 
of an inclusion dependency id whose source is a relation R is 
where we use the successor relation, and the iteration tech- 
nique of Theorem 3.1. Again, it is easy to check an inclusion 
dependency for a source relation consisting of only a single 
tuple, since this requires only existential quantification. We 
verify an id on source relation R by checking for witnesses 
for one tuple in the source of the dependency at a time, 
iterating on the tuples according to the successor relation. 
We will use a new predicate ChecklncDep(id) whose arity is 
the arity of i?. CheckIncDep(Jd)(t) holding at some instance 
indicates that t has been verified to satisfy the inclusion de- 
pendency id. This will be done in a "loop" (an "until" in the 
logic) in which we look for a tuple t whose predecessor in 
the order satisfies ChecklncDep(Jd), and which satisfies the 
inclusion dependency; when we find such a tuple, we per- 
form an access to ChecklncDep(id) on it. At the end of this 
"loop", we check that the final tuple in the ordering satisfies 
ChecklncDep (id). D 

The reader may want to look at Figure 2 for a view to 
how the languages with inequalities relate to the languages 
defined previously. 
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5.2 Branching Time Formulas 

Thus far we have discussed only hnear time properties 
of the LTS of a schema with access relations. What about 
branching time logics, which can consider the relationship of 
multiple paths? For example, a branching time logic could 
express that we have reached a point where no further in- 
formation about boolean query Q can be obtained without 
guessing values to enter into forms - e.g. there are possible 
worlds consistent with the known facts where Q is true and 
also consistent worlds where Q is false, but the truth of Q 
can not be revealed by any further sequence of grounded ac- 
cesses. Unfortunately, we will show that even very limited 
branching time expressiveness leads to undecidability. 

Let L be a fragment of first-order logic over the smallest 
vocabulary we have considered thus far; two copies Sp^c, Spost 
of each relation symbol S and the proposition IsBindAcM. 

We will consider a small fragment of branching time logic 
built up from L-formulas, analogously to the way we built up 
AccLTL formulas over sentences of L in the linear time logic. 
Traditional branching time logic allows the combination of 
path quantification with modal operators. In our setting we 
will consider a very simple kind of branching, which looks 
ahead only one step - we will refer to it as CTLex^L), but 
instead of CTL we might as easily have said "basic modal 
logic" or Hennessy-Milner Logic [13], since we only need the 
power of the most basic existential modality to get unde- 
cidability. CV\jex{L) has the rules: every L sentence is 
a formula, boolean combinations of formulas are formulas, 
and if iy9 is a formula then EX(y3 (in modal logic notation, oip) 
is a formula. 

The semantics is defined as a relation (S, t) 1= ip, where t 
is a transition (I, AC, I') in the labelled transition system S 
associated with a schema Sch. When i/j is an L formula, this 
holds iff the relational structure associated to t, M'{t), sat- 
isfies tp in the usual sense of first-order logic. The semantics 
of boolean operators is the usual one. Finally, (S, t) 1= EX((9 
iff there is a successor t' of t such that {S, t') t= ip. Note that 
instead of referring to CTL here, we could have used basic 
modal logic or Hennessy-Milner Logic. Note that Deutsch 
et. al. [12] have shown undecidability for some branch- 
ing time logics over LTS's associated with a similar model 
of relational transducers - but in their case the logics (e.g. 
Theorem 4.14 of [12]) allow one to describe properties of the 
input (analogous to our larger signature SchAcc), while here 
we can only describe the access propositionally. 

We show that even this restricted logic is undecidable, 
even when the base formulas are existential. 

Theorem 5.3. Satisfiability o/ CTL£;x(FOo^acc) *■* ''"- 
decidable 

Proof. We reduce from the problem of implication of 
a functional dependency (FD) from a set of functional de- 
pendencies and inclusion dependencies (IDs) for relational 
databases. This is known to be undecidable [6]. 

Let r be a set of inclusion and functional dependencies 
over a relational schema Sch and a an FD. For simplicity, we 
will assume all positions in the schema have the same type 
(say, integer type). We will first extend Sch with additional 
relations, along with access patterns. 

For each relation R of Sch, we have an access method Fillfl 
on R with no inputs. Thus each access (Fill_R, 0) returns an 
essentially random configuration of R. We also have ad- 
ditional relations Chk (R), having twice the arity of 7? 



and CheckIncDep(i?) having the same arity as R. We have 
boolean access methods on all of these additional relations 
- that is, methods where all positions are in the input. 
Our reduction will create a formula i/>(r,a) of the form: 

EX(Fillij,AEX(-AEX(Fillfl„ A'Pa^ A Va ^ V^.))) 

tder ider 

where ipf^, ipid, and (fi^„ will be defined below, but we explain 
their mission now. For each functional dependency fd £ F, 
the formula (fiM will hold on a transition t = (I, AC, I') ex- 
actly when fd holds on the restriction of I' to the schema 
predicates from Sch, and similarly for i^id. The formula (fi^cr 
checks that I' does not satisfy the functional dependency 
a. Thus this formula will imply that the configuration is a 
witness showing that F does not imply a. 

We now explain how the different formulas are built. Let 
id = R: P ^ p where P are positions of relation R and p is 
a position of R. The formula </9fd will be: 

AX (3xy Chk™(i?)po.t(i',y) A 

A^i=yi A Rpost{x) A impost (y) 
inP 

=> 3xy Chk^^ {R)postix, if) /\x'p = y'pj 

Here we use the derived "box" modality AXi^ = -.EX-.(^. Note 
that (^fd occurs in formula ^{r,a) in a context where we 
know that only accesses to Ri have been done - hence only 
in contexts where Chk^^(7?) must be empty. Since the only 
access methods for the relations Chk (R) are boolean, this 
means that after one transition we can have at most one 
tuple in Chk^°(i?)post(a;, j7). Thus doing a modality AX 
followed by a test that Chk {R)(x,y)A 7?post (x) A i?post ( j7) 
holds amounts to testing an arbitrary pair x, y satisfying R 
prior to the access. The formula thus asserts that for any 
such pair of tuples in R, if they agree on all positions in the 
source of the FD, they agree on the target of the FD. 
We can use a similar trick with the formula tp^cr'- 

EX(3xy Chk^^ {R)postix,y) A A ^z = yz /\ 

^ ieP 

-Rpost(:r) A RpoBt{y)A 

-•3xy Ch\<F^ {R)post{x ,y) Ax'p = y'p^ 

Now fix an id i?[Ai,---, A„] c S[Bi,---,Bn\, and we define 
(^id to be 

AX^ IsBindchecklncDep(fl) ^Rpost{x) A 

3x CheckIncDep(i?)post(a?) =^ 

EX( IsBindchcckincDcp(5) A3f CheckIncDep(i?)post(:r) A 

3j7CheckIncDep(S')post(j7) A /\ xa, = J/s,)) 

This states that whenever we do a "test access" that returns 
an element of R, there is some access we can do immediately 
afterwards in the LTS that reveals a matching tuple in S. As 
in the case of ipfd above, the accesses we perform are boolean, 
and hence cannot be creating any new elements of S - thus 
the revealed match must have been in the configuration prior 
to the access. U 
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Language 


Complexity 


DjC 


FD 


DF 


AccOr 


AccLTL(FO^+;^*) 


undecidable 


Yes 


Yes 


Yes 


Yes. 


AccLTL(FOi;,) 


undecidable 


Yes 


No 


Yes 


Yes 


AccLTL+ 


in 3EXPTIME 


Yes 


No 


Yes 


Yes 


A-automata 


2EXPTIME-compl. 


Yes 


No 


Yes 


Yes 


AccLTL(FO^!acc) 


PSPACE-compl. 


Yes 


No 


No 


Yes 


AccLTL(FO^XJ 


PSPACE-compl. 


Yes 


Yes 


No 


Yes 


AccLTL(X)(FO,^:x*cc) 


E2 -compl. 


Yes 


Yes 


No 


No 



Table 1: Complexity and application examples for path specifications. 




Figure 2: Inclusions between language classes. 



6. CONCLUSIONS AND RELATED WORK 

In this work we introduced the notion of querying the 
access paths that are allowed by a schema. We presented 
decidable specification languages for doing this, and gave un- 
decidability results showing several limits of such languages. 
Figure 2 shows the inclusions of the languages considered 
in the paper, excluding those for branching time. All of 
the containments shown in the diagram are straightforward. 
The containment of FOq^acc in AccLTL* does require one 
to deal with the fact that FOq^acc sentences are not re- 
quired to be binding-positive. The inclusion follows by first 
rewriting negated 0-ary IsBindAcM predicates using the rule 
IsBindAcM = VacM'*AcM IsBindAcM', then replacing the 0- 
ary predicate by existentially-quantified n-ary predicates. 

All the inclusions in the diagram also turn out to be strict. 
We omit the proofs for this, which use standard techniques: 
e.g. A-automata can express parity conditions on the length 
of paths, which first-order languages like AccLTL^, or even 
AccLTL(FOitc), can not do. 

Table 1 shows the complexity of satisfiability for each 
specification formalism, along with application examples. 
DjC indicates that the language can express relevance of 
an access in the presence of disjointness constraints, while 
FD, DF, AccOr refer to functional dependencies, dataflow re- 
strictions, and access order restrictions, respectively. 

Our work leaves open a number of questions concerning 
the logics we study - for example, we leave open the ex- 
act complexity of AccLTL^, which lies between double- and 
triple- exponential time. We also do not have tight bounds 
for our more restricted fragments (e.g. with only the 0-ary 
version of IsBindAcM ) in the important case of grounded 
access paths. 

Although this is, to the best of our knowledge, the first 
work on languages for describing access paths through a 



schema with binding patterns, there is a strong formal con- 
nection to work on verifying data-driven services, as well as 
other work in the area of hidden Web querying. We review 
the closest connections below. 

Data-driven services. Our work is closely related to a line 
of research on relational transducers and models for data- 
driven services, beginning with Abiteboul et. al.'s [2], and 
continuing through work of Spielmann [19], Deutsch, Su and 
Vianu (e.g. [12]), Fritz et. al. ]14], and Deutsch et. al. ]10]. 
All of these works deal with specification languages for tran- 
sition systems in which transitions may involve the consum- 
ing of relational inputs from an external environment, the 
production of output tuples, and the modification of internal 
state (perhaps in the form of an additional relational store). 
In our application, we talk of accesses rather than inputs 
from an environment, with a response consisting of reveal- 
ing a hidden database instance, rather than updating an 
internal store. But in the results of this paper, one can just 
as easily think of identifying the hidden Web database with 
an internal store, with the accesses being non-deterministic 
inserts into the store. 

Nevertheless, the logics that arise naturally in our setting 
appear orthogonal to those studied in prior work. The initial 
Abiteboul et. al. paper ]2] focused on "Spocus transducers" 
(semi-positive output and cumulative state) which take full 
relational inputs, with their internal relations only accumu- 
lating them. A direct comparison with our model is difficult, 
since we do not have a notion of "output" - but if we restrict 
Spocus transducers to boolean output and singleton inputs, 
they are not as powerful as our model, since in our case 
the internal state can be modified in non-trivial ways. J2] 
proves an undecidability result for an extension of Spocus 
transducers in which the inserted data is allowed to be a 
projection of the "input relations" (Prop. 3.1 of J2]). The 
technique applied is similar to that in Theorem 5.2, but pro- 
jection is orthogonal to the update given by access methods. 
In our terms, this extension would amount to having the in- 
formation added to the hidden database be a projection of 
the accessed relations. On the other hand, the addition of 
projection does not give the ability to model access meth- 
ods, which restrict the input relations by requiring them to 
satisfy a selection criterion. 

Later works [19, 12, 10, 14] deal with transducers that 
can delete as well as insert into their internal state. A key 
restriction is input- guardedness, which insures decidability 
]12] - input guardedness requires quantifications to be re- 
stricted to tuples generated from the environmental inputs. 
The analogous restriction in our setting would be to restrict 
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quantification to the bindings, which would be much weaker 
than the logics we consider. Thus our decidability and com- 
plexity results are not subsumed by these works. On the 
other hand, guarded quantification over relational inputs is 
not supported by our logics, and hence we do not claim to 
subsume results in these works. In addition, [10] allows a 
built-in linear order on the domain, which we do not con- 
sider for our largest logics. Later work by Damaggio et. al. 
considers even richer signatures, including arithmetic [9]. 

Hidden Web querying. Our work is directly inspired 
by previous results on static analysis of schemas with lim- 
ited access patterns, a line of work tracing back (at least) 
to Ullman's work [20] and Rajaraman et. al. [18[, con- 
tinuing with Chang/Li's work in the early 2000s [16, 15] 
Ludascher/Nash's and Deutsch et. al.'s work in the mid- 
2000's [17, 11] and Cah et. al. [5[. AU of them deal in 
one way or another with what sequences can occur within 
a sequence with limited access patterns. For example, the 
question of whether a query can always be answered us- 
ing exact grounded access paths - the focus of most of these 
works above - can be expressed as a property of the LTS. Ex- 
act complexity bounds for query answering derived from the 
works above. Containment under access patterns has also 
been studied, particularly in [5[, which establishes a coN- 
EXPTIME upper bound for conjunctive queries. [3[ proves 
a matching coNEXPTIME lower bound for containment for 
conjunctive queries, and a co-2NEXPTIME upper bound for 
positive queries. [3[ also defines the notion of long-term rele- 
vance (LTR) . They prove a Ej-completeness result for LTR 
over general access paths ("independent accesses", in their 
terminology) while providing a NEXPTIME-completeness 
result for conjunctive queries and a 2NEXPTIME bound 
for LTR of positive queries over grounded accesses paths 
("dependent accesses") . 

Our work provides a general framework where we can 
express properties of access paths, including containment, 
LTR, their combinations, and their restrictions to constraints. 
By providing these within a boolean closed logic, we give 
a flexible means of combining properties that one wishes 
to verify. Our 2EXPTIME result for non-emptiness of A- 
automata gives a bound on containment under access pat- 
terns and long-term relevance, as mentioned in the discus- 
sion after Theorem 4.6. This is better than the prior bounds 
from [5, 3]. 

Note that [3] also makes some erroneous claims: 1. A 
co2NEXPTIME lower bound for containment of positive 
queries under access patterns, which is at odds (relative to 
complexity-theoretic hypothesis) with our 2EXPTIME up- 
per bound 2. A coNEXPTIME upper bound for contain- 
ment of UCQs under general access patterns. The proof 
given there only works for schemas with a single-access per 
relation, while in subsequent work, we have shown that the 
problem is 2EXPTIME hard if the single-access restriction 
is dropped. 
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